mmg_233_2013_genetics_genomicswikiaorg-20200214-history
Sequence Alignment
Overview Sequence alignment is a method of arranging two or more nucleotide or protein strands to find similar regions. It is used by tools such as BLAST, BWA, and Bowtie. There are two important types of sequence alignment: Needleman-Wunsch global alignment and Smith-Waterman local alignment. A global alignment forces the entire strands to match each other, a local alignment normally matches one strand to part of another. The algorithm can be used to determine similarity and homology between multiple sequences, to align sequence(s) to a reference genome, to construct a reference genome, and to find SNPs. Alignment to a reference genome is normally a first step in SNP and structural variant identification Sequence Alignment. (n.d.). In Wikipedia. Retrieved August 10, 2004, from http://en.wikipedia.org/wiki/Sequence_alignmentHowe, R. C. (2013, 10 April). "Methods in bioinformatics notes". Retrieved from http://www.uvm.edu/~rchowe/notes/CS232/. Usage in Genomic Sequencing Sequence alignment is important in large applications because it allows longer sequences to be constructed from shorter "reads", and if aligned to a reference genome, it allows genomic coordinates to be assigned to each base position. This allows a whole genome to be constructed out of a number of shorter reads. The most current human reference genome (hg19) is 3,137,161,264 bp long and most sequencing reads are about 100 bp long, and accurate whole-genome and whole-exome sequencing can be achieved with a much lower coverage if the reads are aligned to a reference genome. Algorithm The algorithms for Needleman-Wunsch and Smith-Waterman sequence alignment are very similar. Each algorithm is controlled by three parameters: * A match score * A mismatch score * A gap penalty For two sequences x and y, with the ith residue in x represented by xi, the Needleman-Wunsch algorithm uses the following score calculation: F(0,0) = 0 F(i,j) = \max \begin{cases} F(i-1,j-1) + s(x_i,y_j)\\ F(i-1,j) - d\\ f(i,j-1) - d \end{cases} where d is the gap penalty and s(xi, yi) is the score of the matching of the bases at xi and yi. The best alignment is then the path from (i, j) = (0,0) to (m,n) with the highest total score (by adding each score). For a visual representation of this, see a sequence-alignment tool I wrote for MMG 232. The following is an example alignment of ATTGGCC and CGATTCC: * A T T G G C C * 0 -0.5 -1.0 -1.5 -2.0 -2.5 -3.0 -3.5 C -0.5 -1.0 -1.5 -2.0 -2.5 -3.0 -1.5 -2.0 G -1.0 -1.5 -2.0 -2.5 -1.0 -1.5 -2.0 -2.5 A -1.5 0.0 -0.5 -1.0 -1.5 -2.0 -2.5 -3.0 T -2.0 -0.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 T -2.5 -1.0 0.5 2.0 1.5 1.0 0.5 0.0 C -3.0 -1.5 0.0 1.5 1.0 0.5 2.0 1.5 C -3.5 -2.0 -0.5 1.0 0.5 0.0 1.5 3.0 The aligned sequence for this will then be: --ATTGGCC CGATT--CC The Smith-Waterman algorithm is the same except that the sum path can start and end at any (i,j) and the optimality substructure F(i,j) is F(i,j) = \max \begin{cases} 0\\ F(i-1,j-1) + s(x_i,y_j)\\ F(i-1,j) - d\\ f(i,j-1) - d \end{cases} Trivia * This is the same algorithm used by spell check. References